Neural coding of a natural stimulus ensemble: 
Uncovering information at sub— millisecond resolution 
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Our knowledge of the sensory world is encoded by neurons in sequences of discrete, identical pulses 
termed action potentials or spikes. There is persistent controversy about the extent to which the 
precise timing of these spikes is relevant to the function of the brain. We revisit this issue, using the 
motion-sensitive neurons of the fly visual system as a test case. New experimental methods allow 
us to deliver more nearly natural visual stimuli, comparable to those which flies encounter in free, 
acrobatic flight, and new mathematical methods allow us to draw more reliable conclusions about 
the information content of neural responses even when the set of possible responses is very large. 
We find that significant amounts of visual information are represented by details of the spike train 
at millisecond and sub-millisecond precision, even though the sensory input has a correlation time 
of ~ 60 ms; different patterns of spike timing represent distinct motion trajectories, and the absolute 
timing of spikes points to particular features of these trajectories with high precision. Under these 
naturalistic conditions, the system continues to transmit more information at higher photon flux, 
even though individual photoreceptors are counting more than one million photons per second, and 
removes redundancy in the stimulus to generate a more efficient neural code. 



I. INTRODUCTION 

Throughout the brain, information is represented by 
discrete electrical pulses termed action potentials or 
'spikes' 1]. For decades there has been controversy about 
the extent to which the precise timing of these spikes is 
significant: should we think of each spike arrival time as 
having meaning down to millisecond precision @, S 01 j 
or does the brain only keep track of the number of spikes 
occurring in much larger windows of time? Is precise tim- 
ing relevant only in response to rapidly varying sensory 
stimuli, as in the auditory system Q, or can the brain 
construct specific patterns of spikes with a time resolu- 
tion much smaller than the time scales of the sensory 
and motor signals that these patterns represent [1, @? 
Here we address these issues using the motion-sensitive 
neurons of the fly visual system as a model Q • 

We bring together new experimental methods for deliv- 
ering truly naturalistic visual inputs [8( and new math- 
ematical methods that allow us to draw more reliable 
inferences about the information content of spike trains 
S Q3 EH- We find that as we improve our time res- 
olution for the analysis of spike trains from 2 ms down 
to 0.2 ms we reveal nearly one-third more information 
about the trajectory of visual motion. The natural stim- 
uli used in our experiments have essentially no power 
above 30 Hz, so that the precision of spike timing is not 
a necessary correlate of the stimulus bandwidth; instead 
the different patterns of precise spike timing represent 
subtly different trajectories chosen out of the stimulus 
ensemble. Further, despite the long correlation times of 
the sensory stimulus, segments of the neural response 



separated by ~ 30 ms provide essentially independent in- 
formation, suggesting that the neural code in this system 
achieves decorrelation [l2l [l3| in the time domain. This 
decorrelation is not evident in the time dependent spike 
rate alone, but the time scale for the independence of 
information does match the time scale on which visual 
motion signals are used to guide behavior [l6j |. 



II. POSING THE PROBLEM 

Flies exhibit a wide variety of visually guided behav- 
iors, of which perhaps the best known is the optomotor 
response, in which visual motion drives a compensating 
torque, stabilizing straight flight This system of- 

fers many advantages for the exploration of neural coding 
and computation: there is a small groups of identified, 
wide-field motion-sensitive neurons [7( that provide an 
obligatory link in the process [l5j], and it is possible to 
make very long, stable recordings from these neurons as 
well as to characterize in detail the signal and noise prop- 
erties of the photoreceptors that provide the input data 
for the computation. In free flight, the trajectory of vi- 
sual motion is determined largely by the fly's own motion 
through the world, and there is a large body of data on 
flight behavior under natural conditions [id . UtI [H, Qj| , 
offering us the opportunity to generate stimuli that ap- 
proximate those experienced in nature. But the natural 
visual world of flies involves not only the enormous an- 
gular velocities associated with acrobatic flight; natural 
light intensities and the dynamic range of their varia- 
tions also are very large, and the fly's compound eyes 
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FIG. 1: Neural responses to a natural stimulus ensemble. At left we show a schematic of the experimental setup (see Methods 
for details). A fly is immobilized with wax, its body in a plastic tube, with its head protruding. Through a small hole in the 
back of the head an electrode is inserted to record extracellular potentials from HI, a wide field neuron sensitive to horizontal 
motion. This signal is amplified, fed through a slip ring system to a second stage amplifier and filter, and recorded by a data 
acquisition card. In synchrony with its master timer clock, the DAQ card generates a 500 Hz frame clock signal. Every 2 ms, 
through a bidirectional parallel port, this clock triggers a successive read of a divisor value from a file stored in the stimulus 
laptop computer. The Intel 8254 Counter/Timer chip uses this divisor value to divide down the pulse frequency of a free 
running 8 MHz clock. In this way, in each successive 2 ms interval, and in strict synchrony with the data taking clock, a defined 
and evenly spaced burst of pulses is produced. These pulses drive the stepper motor, generating the angular velocity signal. A 
brief segment of this motion stimulus is shown in the top right panel, below which we plot a raster of action potentials from 
HI in response to 100 repetitions of this stimulus. At bottom we expand the scale to illustrate (at left) that individual spikes 
following a transition from negative to positive velocity jitter from trial to trial by ~ 1 ms: the standard deviations of spike 
times shown here are 0.72ms for the first spike (•), 0.81ms for the second spike (o), and 1.22ms for the third spike (x). When 
we align the first spikes in this window, we see (at right) that the jitter of interspike intervals is even smaller, 0.21 ms for the 
first interval and 0.69 ms for the second interval. Our challenge is to quantify the information content of such precise responses. 
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are stimulated over more than 2ir steradians; all of these 
features are difficult to replicate in the laboratory [20| . 
As an alternative, we have moved our experiments out- 
side Q, so that flies experience the scenes from the re- 
gion in which they were caught. We record from a single 
motion-sensitive cell, HI, while we rotate the fly along 
trajectories that are modeled on the natural flight tra- 
jectories (see Methods for details). For other approaches 
to the delivery of naturalistic stimuli in this system see 

m, 

A schematic of our experiment, and an example of 
the data we obtain, are shown in Fig [T] We see qual- 
itatively that the responses to natural stimuli are very 
reproducible, and we can point to specific features of 
the stimulus — such as reversals of motion direction — that 
generate individual spikes and interspike intervals with 
better than millisecond precision. The challenge is to 
quantify these observations: do precise and reproducible 
patterns of spikes occur just at some isolated moments, 
or does looking at the spike train with higher time resolu- 
tion generally provide more information about the visual 
input? 

Precise spike timing endows each neuron with a huge 
"vocabulary" of responses [J, 0], but this potential ad- 
vantage in coding capacity creates challenges for exper- 
imental investigation. If we look with a time resolution 
of t = 1 ms, then in each bin of size r we can see either 
zero or one spike; across the behaviorally relevant time 
scale of 30 ms the neural response thus can be described 
as a 30-bit binary word, and there are 2 30 , or roughly 
one billion such words. Although some of these responses 
never occur (because of refractoriness) and others are ex- 
pected to occur only with low probability, it is clear that 
if precise timing is important then neurons can generate 
many more meaningfully distinguishable responses than 
the number that we can sample in realistic experiments. 

Can we make progress on assessing the content and 
meaning of neural responses even when we can't sam- 
ple all of them? Some hope is provided by the classical 
problem of how many people need to be present in a 
room before there is a reasonable chance that they share 
a birthday. This number, N ~ 23, is vastly less than 
the number of possible birthdays, K = 365. Turning 
this argument around, if we didn't know the number of 
possible birthdays we could estimate it by polling N peo- 
ple and checking the frequency of coincidences. Once N 
is large enough to generate several coincidences we can 
get a pretty good estimate of K, and this happens when 
N ~ \f~K <C K. Some years ago Ma proposed that this 
coincidence counting method be used to estimate the en- 
tropy of physical systems from molecular dynamics or 
Monte Carlo simulations [12] (see also Ref [23[). If these 
arguments could be generalized, it would become feasi- 
ble to estimate the entropy and information content of 
neural responses even when experiments provide only a 
sparse sampling of these responses. The results of Ref 
[3, fiol ] provide such a generalization. 

To understand how the methods of Ref @ generate 
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FIG. 2: Systematic errors in entropy estimation. We consider 
a coin with unknown probability p of coming up heads; from 
JV coin flips we try to estimate the entropy S = — p log 2 p — 
(1 — p) log 2 (l — p); see Methods for details of the calculations. 
At left, we make Bayesian estimates starting from the prior 
hypothesis that all values of p are equally likely, V(p) = 1. 
We show how the best estimate S' differs from the true value 
So when this deviation is measured in units of the estimated 
error bar SS (posterior standard deviation). For small num- 
bers of samples, the best estimate is systematically in error 
by more than two times the size of the error bar, so we would 
have false confidence in a wrong answer, even at intermediate 
values of the entropy which are most relevant for real data. At 
right, we repeat the same procedure but with a prior hypoth- 
esis that all possible value of the entropy are equally likely, 
V(S) = 1. Systematic errors still appear, but they are more 
nearly compatible with the error bars, even at small N, and 
especially in the range of entropies which is relevant to our 
experiments. 



more accurate entropy estimates from small samples, it 
is useful to think about the simpler problem of flipping 
a coin under conditions where we don't know the prob- 
ability p that it will come up heads. One strategy is to 
count the number of heads nu that we see after N flips, 
and identify p = uh/N; if we then use this "frequentist" 
or maximum likelihood estimate to compute the entropy 
of the underlying distribution, it is well known that we 
will underestimate the entropy systematically [24l. l25l . l26l ]. 
Alternatively, we could take a Bayesian approach and say 
that a priori all values of < p < 1 are equally likely; 
the standard methods of Bayesian estimation then will 
generate a mean and an error bar for our estimate of the 
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entropy given N observations. As shown in Fig [2j this 
procedure actually leads to a systematic overestimate of 
the entropy in cases where the real entropy is not near its 
maximal value. More seriously, this systematic error is 
larger than the error bars that emerge from the Bayesian 
analysis, so we would be falsely confident in the wrong 
answer. 

Figure [2] also shows us that if we use a Bayesian ap- 
proach with the a priori hypothesis that all values of the 
entropy are equally likely, then (and as far as we know, 
only then) we find estimates such that the systematic 
errors are comparable to or smaller than the error bars, 
even when we have seen only one sample. Thus the prob- 
lem of systematic errors in entropy estimation is not, as 
one might have thought, the problem of not having seen 
all the possibilities; the problem rather is that seemingly 
natural and unbiased prior hypotheses about the nature 
of the underlying probabilities correspond to highly bi- 
ased hypotheses about the entropy itself, and this prob- 
lem gets much worse when we consider distributions over 
many alternatives. The strategy of Ref @ thus is to 
construct, at least approximately, a 'flat prior' on the en- 
tropy (see Methods for details). The results of Ref [ll| 
demonstrate that this procedure actually works for both 
simulated and real spike trains, where 'works' means that 
we generate estimates that agree with the true entropy 
within error bars even when the number of samples is 
much smaller than the number of possible responses. As 
expected from the discussion of the birthday problem, 
what is required for reliable estimation is that the num- 
ber of coincidences be significantly larger than one [To| . 



III. WORDS, ENTROPY AND INFORMATION 

Armed with tools that allow us to estimate the entropy 
of neural responses, we first analyze a long experiment 
in which the fly experiences a continuous trajectory of 
motion with statistics modeled on those of natural flight 
trajectories (FigO see Methods for details). As shown in 
Fig [4^,, we examine segments of the response of duration 
T, and we break these segments into discrete bins with 
time resolution r. For sufficiently small r each bin either 
has one or zero spikes, and hence the response becomes a 
binary word with T/r bits, while in the opposite limit we 
can let t — T and then the response is the total number 
of spikes in a window of size T; for intermediate values 
of r the responses are multi-letter words, but with larger 
than binary alphabet when more than one spike can occur 
within a single bin. An interesting feature of these words 
is that they occur with a probability distribution similar 
to the distribution of words in English (Zipf's law; Fig 
|4Jd). This Zipf-like behavior emerges only for T > 20 ms, 
and was not observed in experiments with less natural, 
noisy stimuli 

With a fixed value of T, improving our time resolution 
(smaller r) means that we distinguish more alternatives, 
increasing the "vocabulary" of the neuron. Mathemati- 
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FIG. 3: Constructing a naturalistic stimulus, (a) Digitized 
version of original video tracking data by Land and Collett 
The panel shows traces of a leading fly (blue) and a 
chasing fly (green). Successive points along the trajectories 
are recorded at 20 ms intervals. Every tenth point along each 
trajectory is indicated by a number. From these traces we 
estimate rotational velocities of the body axis by calculating 
the angular change in orientation of the trajectory from one 
point in the sequence to the next, and dividing by 20 ms. The 
result of this calculation for the leading fly is shown in panel 
(b). (c) From these data (on both flies) we construct a joint 
distribution, P(14, Vk+i), of successive velocities taken 20 ms 
apart, (d) Short sample of a trajectory constructed using the 
distribution in (c) as a Markov process, and then interpolat- 
ing the velocity trace to 2 ms resolution, (e) Probability den- 
sities of angular velocity generated from this Markov process 
(black dashed line) and scaled down by a factor of two (black 
line) to avoid destabilizing the experiment; distributions are 
symmetric and we show only positive velocities. For compar- 
ison we show (red line) the distribution of angular velocities 
recorded for head motion of Calliphora during episodes of sac- 
cadic turning [19j. (f) Power spectrum of synthesized velocity 
signal, demonstrating the absence of power above 30 Hz. (g) 
As in (e) but for the accelerations. Note that the distribution 
of our synthesized and scaled signal is surprisingly close to 
that found for saccadic head motions. 



cally this means that the entropy S(T, r) of the neural 
responses is larger, corresponding to a larger capacity 
for carrying information. This is shown quantitatively in 
FigHt, where we plot the entropy rate, S(T,t)/T. The 
question of whether precise spike timing is important in 
the neural code is precisely the question of whether this 
capacity is used by the system to carry information 0, 0] • 
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To estimate the information content of the neural re- 
sponses, we follow the strategy of Refs 0, HtJ- Roughly 
speaking, the information content of the 'words' gener- 
ated by the neuron is less than the total size of the neural 
vocabulary because there is some randomness or noise in 
the association of words with sensory stimuli. To quan- 
tify this noise we choose a five second segment of the 
stimulus, and then repeat this stimulus 100 times. At 
each moment < t < 5 s in the cycle of the repeated 
stimulus, we can look across the one hundred trials to 
sample the different possible responses to the same in- 
put, and with the same mathematical methods as be- 
fore we use these samples to estimate the 'noise entropy' 
S n (T,r\t) in this 'slice' of responses. The information 
which the responses carry about the stimulus then is 
given by 7(T,r) = S(T,t) - {S n (T,r\T)) t , where (• ■ -)t 
denotes an average over time t, which implicitly is an 
average over stimuli. It is convenient to express this as 
an information rate i?; n f (T, r) = I(T,t)/T, and this is 
what we show in FigdJi, with T = 25 ms chosen to reflect 
the time scale of behavioral decisions [l|| ■ 

The striking feature of Fig [4ji is the growth of infor- 
mation rate with time resolution. We emphasize that 
this measurement is made under conditions comparable 
to those which the fly encounters in nature — outdoors, 
in natural light, moving along trajectories with statistics 
similar to those observed in free flight. Thus, under these 
conditions, we can conclude that the fly's visual system 
carries information about motion in the timing of spikes 
down to sub-millisecond resolution. Quantitatively, in- 
formation rates double as we increase our time resolution 
from t — 25 ms to r = 0.2 ms, and the final ~ 30% of 
this increase occurs between r = 2 ms and r = 0.2 ms. 
In the behaviorally relevant time windows [lj| , this 30% 
extra information corresponds to a almost a full bit from 
this one cell, which would provide the fly with the abil- 
ity to distinguish reliably among twice as many different 
motion trajectories. 



IV. WHAT DO THE WORDS MEAN? 

The information rate tells us how much we can learn 
about the sensory inputs by examining the neural re- 
sponse, but it doesn't tell us what we learn. In partic- 
ular, we would like to make explicit the nature of the 
extra information that emerges as we increase our time 
resolution from r = 2 ms to r = 0.2 ms. To do this, we 
look at particular "words" in a segment of the neural re- 
sponse, as shown in Fig. [51 and then examine the motion 
trajectories that correspond to these words [2j| . For sim- 
plicity, we consider all responses that have two spikes in 
successive 2 ms bins, that is the pattern 11 when seen at 
t = 2 ms resolution. When we improve our time reso- 
lution to t = 0.2 ms, some of these responses turn out 
to be of the form 10000000000000000001, while at the 
other extreme some of the responses have the two spikes 
essentially as close as is possible given the refractory pe- 



riod, 00000100000000100000. Remarkably, as we sweep 
through these subtly different patterns — which all have 
the same average spike arrival time but different inter- 
spike intervals — the average velocity trajectory changes 
form qualitatively, from a smooth "on" (negative to pos- 
itive velocity) transition, to a prolonged period of posi- 
tive velocity, to a more complex waveform with off and 
on transitions in succession. Examining more closely the 
distribution of waveforms conditional on the different re- 
sponses, we see that these differences among mean wave- 
forms are in fact discriminable. Thus, variations in inter- 
spike interval on the millisecond or sub-millisecond scale 
represent significantly different stimulus trajectories. 

A second axis along which we can ask about the na- 
ture of the extra information at high time resolution 
concerns the absolute timing of spikes. As an exam- 
ple, responses which at r = 2 ms resolution are of the 
form 11 can be unpacked at r = 0.2 ms resolution to 
give patterns ranging from 01000000001000000000 to 
00000000010000000010, all with the same interspike in- 
terval but with different absolute arrival times. As shown 
in Fig [5j all of these responses code for motion trajec- 
tories with two zero crossings, but the times of these 
zero crossings shift as the spike arrival times shift. Thus, 
whereas the times between spikes represent the shape 
of the waveform, the absolute arrival time of the spikes 
mark, with some latency, the time at which a specific fea- 
ture of the waveform occurs, in this case a zero crossing. 
Again we find that millisecond and sub-millisecond scale 
shifts generate discriminable differences. 

The idea that sub-millisecond timing of action po- 
tentials could carry significant information is not new, 
but the clearest evidence comes from systems in which 
the dynamics of the stimulus itself has significant sub- 
millisecond structure, as in hearing and electroreception 
[5|,[32j . Even for HI, experiments demonstrating the im- 
portance of spike timing at the ~ 2 ms level [3, [33| could 
be criticized on the grounds that the stimuli had unnatu- 
rally rapid variations. It thus is important to emphasize 
that, in these experiments, HI does not achieve millisec- 
ond precision simply because the input has a bandwidth 
of kiloHertz; in fact, the stimulus has a correlation time 
of ~ 60 ms (Fig|6l), and 99.9% of the stimulus power is 
contained below 30 Hz (Fig [3f ) . 



V. REDUNDANCY REDUCTION 

The long correlation time of these naturalistic stim- 
uli also raises questions about redundancy — while each 
spike and interspike interval can be highly informative, 
does the long correlation time of the stimulus inevitably 
mean that successive spikes carry redundant information 
about essentially the same value of the instantaneous ve- 
locity? Certainly on very short time scales this is true: 
although i?i n fo(r, t) actually increases at small T, since 
larger segments of the response reveal more informative 
patterns of several spikes [H, Hid ^ does decrease at 
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FIG. 4: Words, entropy and information in the neural response to natural signals, (a) Schematic showing how we convert 
the sequence of action potentials into discrete 'words' 0, H3]- At the top we show the stimulus and spike arrival times (red 
dots) in a 64 ms segment of the experiment. We treat this as two successive segments of duration T = 32, ms, and divide these 
segments into bins of duration r = 2, 8, or 32 ms. For sufficiently small r (here, r = 2 ms), each bin contains either zero or one 
spike, and so each neural response becomes a binary word with T/t bits; larger values of r generate larger alphabets, until at 
t = T the response of the neuron is just the spike count in the window of duration T. Note that the words are shown here as 
non-overlapping; this is just for graphical convenience, (b) The distribution of words with r = 1 ms, for various values of T; 
words are plotted in rank order. We see that, for large T (T = 40 or 50 ms) but not for small T (T = 20 ms), the distribution 
of words has a large segment in which the probability of a word is P cx 1/ rank" , corresponding to a straight line on this double 
logarithmic plot. Similar behavior is observed for words in English, with a = 1, which we show for comparison (solid line); this 
is sometimes referred to as Zipf's law [281 ] . (c) The entropy of a T = 25 ms segment of the spike train, as a function of the time 
resolution r with which we record the spikes. We plot this as an entropy rate, S(T,r)/T, in bits/s; this value of T is chosen 
because this is the time scale on which visual motion drives motor behavior [l(J ■ For comparison we show the theoretical results 
(valid at small r) for a Poisson process [lj], and a Poisson process with a refractory period [llj, with spike rates and refractory 
periods matched to the data. Note that the real spike train has significantly less entropy than do these simple models. In Ref 
[111 ] we showed that our estimation methods can recover the correct results for these models using data sets comparable in size 
to the one analyzed here; thus our conclusion that real entropies are smaller cannot be the result of undersampling. Error bars 
are smaller than the data points, (d) The information content of T = 25 ms words, as a function of time resolution r; again we 
plot this as a rate Ei n { (T, r) = I(T,t)/T, in bits/s. 



larger T, a sign of redundancy. On the other hand, the 
approach to a constant information rate happens very 
rapidly: we can measure the redundancy on time scale 
T by computing Tr(T, r) = 21 (T, t)/I(2T, t) - 1, where 
T = means that successive windows of size T provide 
completely independent information, and T = 1 means 
that they are completely redundant. As shown in Fig 
El T / (T, t) decays rapidly, on a time scale of less than 
20 ms. In contrast, correlations in the stimulus decay 



much more slowly, on the ~ 60 ms time scale. Further, 
we can compute at each moment of time the spike rate 
r(t), and this has a correlation time comparable to the 
stimulus itself, suggesting that the decorrelation of in- 
formation is more subtle than a simple filtering of the 
stimulus. 
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FIG. 5: Response conditional ensembles (2^|. We consider five different neural responses, all of which are identical when viewed 
at r = 2 ms resolution, corresponding to the pattern 11, spikes in two successive bins. At left, we consider responses which, 
at higher time resolution, correspond to different interspike intervals. At right, the interspike interval is fixed but higher time 
resolution reveals that the absolute spike arrival times differ. In each case we compute the median motion trajectory conditional 
on the high time resolution response (lines) and we indicate the width of the distribution with bars that range plus and minus 
one quartile around the median. It is clear that changes in interspike interval produce changes in the distribution of stimulus 
waveform that are discriminable, since the mid-quartiles do not overlap. Changes in absolute timing are more subtle, and so 
we estimate the conditional distributions of velocity at each moment in time using the methods of Ref [3Cj, compute the overlap 
of these distributions, and convert the result into the equivalent signal-to-noise ratio d! for discrimination against Gaussian 
noise [3l| . Note that we compute this discriminability using single points in time; d! values based on extended segments of the 
waveforms would be even higher. 



VI. BIT RATES AND PHOTON COUNTING 
RATES 

The ability of the fly's visual system to mark features 
of the stimulus with millisecond precision, even when the 
stimulus correlation time is ~ 60 ms, depends on having 
access to a representation of visual motion with very high 
signal-to-noise ratio. Previous work has suggested that 
this system can estimate motion with a precision close 
to the limits set by noise in the photoreceptor s l35l . |36| , 
which is dominated by photon shot noise [37L |38||. The 
present experiments, however, are done under very dif- 
ferent conditions: velocities of motion are much larger, 
the fly's eye is stimulated over a much larger area, and 
light intensities outdoors are much larger than gener- 
ated by laboratory displays. During the course of our 
experiments we monitor the light intensity at zenith, us- 
ing a detector matched to the properties of the fly pho- 
toreceptors (see Methods); from these measurements we 
estimate that the mean light intensity corresponds to 
1.56 x 10 6 photon/s per photoreceptor, which is near the 
limit of the photoreceptor's dynamic range for photon 
counting. Is it possible that photon counting statistics 
still are relevant even at these high rates? 

Because the experiments are done outdoors, there are 
small fluctuations in light intensity from trial to trial as 
clouds drift by and obscure the sun. Although the dy- 
namic range of these fluctuations is less than a factor 
two, the arrival times of individual spikes (e.g., the "first 
spike" after t = 1.75 s in Fig[T]) have correlation coeffi- 
cients of up to p = —0.42 with the light intensity, with 
the negative sign indicating that higher light intensities 
lead to earlier spikes. One might see this effect as a fail- 



ure of the system to adapt to the overall light intensity, 
but it also suggests that some of what we have called 
noise really represents a response to trial-by-trial vari- 
ations in stimulus conditions. Indeed, a correlation be- 
tween light intensity and spike time means that the noise 
entropy S n (T, r\t) in windows which contain these spikes 
is smaller than we have estimated because some of the 
variability can be ascribed to stimulus variation. 

More subtly, if photon shot noise is relevant, we expect 
that on trials with higher light intensity the neuron will 
actually convey more information about the trajectory of 
motion. We emphasize that this is a delicate question. 
To begin, the differences in light intensity are small, and 
we expect (at most) proportionately small effects. Fur- 
ther, as the light intensity increases, the total spike rate 
increases, and this increases both the total entropy and 
the noise entropy. To ask if the system uses the more re- 
liable signal at higher light intensities to convey more in- 
formation we have to determine which of these increases 
is larger. 

To test the effects of light intensity on information 
transmission (see Methods for details), we divide the tri- 
als into halves based on the average light intensity over 
the trial, and we try to estimate the information rates in 
both halves; the two groups of trials differ by just 3% in 
their median light intensities. Since cutting the number 
of trials in half makes our sampling problems much worse, 
we focus on short segments of the response (T = 6 ms) at 
high time resolution (r = 0.2 ms); note that these are still 
"words" with 30 letters. For this case we find that for the 
trials with higher light intensities the information about 
the motion stimulus is larger by A = 0.0204±0.0108 bits, 
which is small but significant at the 94% confidence level. 
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idea that the system performs efficiently both in the tasks 
of estimation and coding, making use of the extra signal- 
to-noise provided by increased photon flux and reducing 
the redundancy of the stimulus as it is transformed into 
spikes. Finally, we note that our ability to reach these 
conclusions depends not just on new experimental meth- 
ods that allow us to generate truly naturalistic stimuli [f| , 
but critically on new mathematical methods that allow 
us to analyze neural responses quantitatively even when 
it is impossible for us to sample the distribution of re- 
sponses exhaustively 0, [ll[ . We expect that these sorts 
of mathematical tools will become even more critical for 
neuroscience in the future. 



FIG. 6: Redundancy reduction in the time domain. We mea- 
sure the redundancy Tx(T, r) (points with error bars) be- 
tween words of length T in the neural response, as explained 
in the text. To allow exploration of large T we work at a time 
resolution r = 3 ms. The redundancy can be compared to cor- 
relations in the stimulus T„ = (v(t+T)v(t)} / {v 2 } (dotted line) 
or correlations in the spike rate T r = (5r(t + T)Sr(t))/((Sr) 2 ) 
(dashed line). Note that the redundancy decays rapidly — we 
show an exponential fit with a time constant of 17.3 ms. In 
contrast, the correlations in the stimulus the firing rate decay 
much more slowly — the solid line, for comparison, shows an 
exponential decay with a time constant of 53.4 ms. Correla- 
tions in spike rate are calculated from a separate experiment 
on the same cell, with 200 repetitions of a 10 s stimulus drawn 
from the same distribution, that generates more accurate es- 
timates of r(t). 
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We find differences with the same sign for all accessible 
combinations of T and r, and the overall significance of 
the difference thus is much larger. Note that since we are 
analyzing T — 6 ms windows, this difference corresponds 
to AR ~ 3 bits/s, 1-2% of the total (cf FigEJ). Thus even 
at rates of more than one million photons per second per 
receptor cell, small increases in photon flux produce sig- 
nificant changes in the transmission of information about 
the visual input. 



VII. CONCLUSION 

To summarize, we have found that under natural stim- 
ulus conditions the fly visual system generates spikes and 
interspike intervals with extraordinary temporal preci- 
sion. As a consequence, the neural response carries a 
substantial amount of information that is available only 
at sub-millisecond time resolution. At this high resolu- 
tion, absolute spike timing is informative about the time 
at which particular stimulus features occur, while differ- 
ent interspike intervals provide a rich representation of 
distinguishable stimulus features. These results provide 
a clear demonstration that the visual system uses sub- 
millisecond timing to provide a richer representation of 
the natural sensory world, at least in this corner of the 
fly's brain. In addition, the data provide support for the 



APPENDIX A: METHODS 

Neural recording and stimulus generation. HI 

was recorded extracellular ly by a short (12 mm shank 
length) tungsten electrode (FHC). The signal was pream- 
plified by a differential bandpass amplifier based on the 
INAlll. After amplification by a second stage samples 
were digitized at 10 kHz by an AD converter (National 
Instruments DAQCard-AI-16E-4, mounted in a Field- 
works FW5066P ruggedized laptop). In off line analysis, 
the analog signal was digitally filtered by a template de- 
rived from the average spike waveform. Spikes were then 
time stamped by interpolating threshold crossing times. 
The ultimate precision of this procedure is limited by the 
signal to noise ratio in the recording; for typical condi- 
tions this error is estimated to be 50 — 100 fis. Note that 
we analyze spike trains down to a precision of r = 200 /is, 
so that some saturation of information at this high time 
resolution may actually result from instrumental limi- 
tations. The experiments were performed outside in a 
wooded environment, with the fly mounted on a step- 
per motor with vertical axis. The speed of the stepper 
motor was under computer control, and could be set at 
2 ms intervals. The DAQ card generates a clock signal 
at 500 Hz in synchrony with the master clock which cal- 
ibrates the neural recording. As explained in the legend 
to Fig [TJ each tick of the clock drives the stepper motor 
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through an amount determined by reading the stimulus 
file stored on a dedicated computer. The motor (SIG- 
Positec RDM566/50 stepper motor, 104 pulses per rev- 
olution) is driven by a controller (SIG-Positec Divistep 
D331.1), which in turn receives pulses at a frequency di- 
vided down from a free running 8 MHz clock; the stimulus 
velocity is represented by the divisor for the pulse fre- 
quency. In this way, the stepper motor is driven in each 
2 ms period, in strict synchrony with the data acquisition 
clock, by steps that are evenly spaced. This design was 
chosen to minimize the effects of discrete steps and to 
maximize the reliability of all timing measurements. To 
stabilize temperature the setup was enclosed by a trans- 
parent plexiglass cylinder (radius 15 cm, height 28 cm), 
with a transparent plexiglass lid. 

Monitoring light intensity and controlling tem- 
perature. The air temperature in the experimental en- 
closure was regulated by a Peltier element fitted with 
heat vanes and fans on both sides for efficient heat dis- 
persal, driven by a custom built feedback controller. The 
temperature could be set over a range from approxi- 
mately five degrees below to fifteen degrees above am- 
bient temperature, and the controller stabilized temper- 
ature over this range to within about a degree. In the 
experiments described here, temperature was 23 ± 1° C. 
An overall measure of light intensity was obtained by 
monitoring the current of a photodiode (Hamamatsu) 
enclosed in a diffusing ping pong ball. The photodiode 
signal was amplified by a logarithmic amplifier operating 
over five decades. The photodiode was located ~ 50 cm 
from the fly, and in the experiments the setup was al- 
ways placed in the shade. The photodiode measurement 
was intended primarily to get a rough impression of rela- 
tive light intensity fluctuations. However, to relate these 
measurements to outside light levels, before the start of 
each experiment a separate calibration measurement of 
zenith radiance was taken using a calibrated light inten- 
sity meter. To relate this measurement to fly physiology, 
the radiance reading was converted to an estimated ef- 
fective fly photoreceptor photon rate. The reading of the 
photodiode was roughly proportional to the zenith inten- 
sity reading, with a proportionality factor determined by 
the placement of the setup and the time of day. To ob- 
tain a practical rule of thumb, the photodiode readings 
were converted to equivalent zenith photon flux values, 
using the current to zenith intensity conversion factor es- 
tablished at the beginning of the experiment. During the 
experiments the photodiode current was sampled at 1 s 
intervals. 

Repeated stimuli. In their now classical experi- 
ments, Land and Collett measured the trajectories of flies 
in free flight [l(| ; in particular they reported the angular 
position (orientation) of the fly vs time, from which we 
can compute the angular velocity v(t). The short seg- 
ments of individual trajectories shown in the published 
data have a net drift in angle, so we include both the 
measured v(t) and —v(t) as parts of the stimulus. We 
use the trajectories for the two different flies in Fig 4 



of Ref [HI, and graft all four segments together, with 
some zero padding to avoid dramatic jumps in velocity, 
generating a stimulus that is 5 seconds in duration and 
has zero drift so that repetition of the angular velocity 
vs time also repeats the angular position vs time. Since 
Land and Collett report data every 20 ms, we interpolate 
to generate a signal that drives the stepper motor at 2 ms 
resolution; interpolation is done using the MATLAB rou- 
tine interp, which preserves the bandlimited nature of 
the original signal and hence does not distort the power 
spectrum. 

Nonrepeated stimulus. To analyze the full entropy 
of neural responses, it is useful to have a stimulus that 
is not repeated. We would like such a stimulus to match 
the statistical properties of natural stimulus segments de- 
scribed above. To do this, we estimate the probability 
distribution P[v(t + At)\v(t)] from the published trajec- 
tories, where At = 20 ms is the time resolution, and then 
use this as the transition matrix of a Markov process 
from which we can generate arbitrarily long samples; our 
nonrepeated experiment is based on a 990 s trajectory 
drawn in this way. The resulting velocity trajectories 
will, in particular, have exactly the same distributions of 
velocity and acceleration as in the observed free flight tra- 
jectories. Although the real trajectories are not exactly 
Markovian, our Markovian approximation also captures 
other features of the natural signals, for example gener- 
ating a similar number of velocity reversals per second. 
Again we interpolate these trajectories to obtain a stim- 
ulus at 2 ms resolution. 

Entropy estimation in a model problem. The 
problem in Fig [2] is that of a potentially biased coin. 
Heads appear with probability p, and the probability of 
observing n heads out of N flips is 

P N (n\p)ocp n (l-p) N ~ n . (Al) 

If we observe n and try to infer p, we use Bayes' rule to 
construct 

P N (p\n) = P N (n\p)^\ cx V(p)p n (l - p) N - n , (A2) 
P N (n) 

where P(p) is our prior and P^{n) = dp Pn (n\p)V (p) . 
Given this posterior distribution of p we can calculate the 
distribution of the entropy, 

S(p) = -plog 2 (p) - (1 -f>)]og a (l-p). (A3) 

We proceed as usual to define a function g(S) that is the 
inverse of S(p), that is g(S(p)) = p; since p and 1 — p 
give the same value of S, we choose < g < 0.5 and let 
g(S) = 1 - g(S). Then we have 

P N (S\n) = [P N (p = g(S)\n) +P N (p = g(S)\n)} ^ . 

_(A4) 

From this distribution, we can estimate a mean S^in) 
and a variance cr 2 (n, N) in the usual way. What interests 
us is the difference between Sjy(n) and the true entropy 
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S(p) associated with the actual value of p characterizing 
the coin; it makes sense to measure this difference in units 
of the standard deviation SS(n, N). Thus we compute 



N 



((S'-S )/5S) = J2 p N(n\p) 



S N (n) - S(p) 
6S(n,N) 



(A5) 



and this is what is shown in Fig[2j We consider two cases. 
First, a flat prior on p itself, so that V(p) = 1. Second, 
a flat prior on the entropy, which corresponds to 



V(p) 



dS(p) 



dp 
log 2 



1-p 



(A6) 
(A7) 



Note that this prior is (gently) diverging near the limits 
p = and p = 1, but all the expectation values that we 
are interested in are finite. 

Entropy estimation: General features. Our dis- 
cussion here follows Refs 0, [ll[ very closely. Consider a 
set of possible neural responses labeled by i = 1, 2, • ■ • , K. 
The probability distribution of these responses, which we 
don't know, is given by p = {p{\. A well studied fam- 
ily of priors on this distribution is the Dirichlct prior, 
parameterized by /?, 



1 



Z(p-K) 



K 



13-1 



K 



(A8) 



Maximum likelihood estimation, which identifies proba- 
bilities with frequencies of occurrence, is obtained in the 
limit (3 — ► 0, while /3 — * 1 is the natural "uniform" prior. 
When K becomes large, almost any p chosen out of this 
distribution has an entropy S = — p\ log 2 p\ very close 
to the mean value, 



(A9) 



where ipo(%) = d\og 2 T(x)/dx, and T(x) is the gamma 
function. We therefore construct a prior which is ap- 
proximately flat on the entropy itself by a continuous 
superposition of Dirichlet priors, 



p(p) 



v m^iv M , (mo 



and we then use this prior to perform standard Bayesian 
inference. In particular, if we observe each alternative i 
to occur n\ times in our experiment, then 



K 



P({m}\p)<x]lp?, 
and hence by Bayes' rule 



i=l 



P(p\{m}) 



Hp)- 



(All) 



(A12) 



Once we normalize this distribution we can integrate over 
all p to give the mean and the variance of the entropy 
given our data {ri;}. In fact, all the integrals can be 
done analytically except for the integral over (3. Soft- 
ware implementation of this approach is available from 
http://nsb-entropy.sourceforge.net/ This basic 
strategy can be supplemented in cases where we have 
prior knowledge about the entropies. In particular, when 
we are trying to estimate entropy in "words" of increas- 
ing duration T, we know that S(T, r) < S(T', r) + S(T - 
T", t) for any T", and thus it makes sense to constrain 
the priors at T using the results from smaller windows, 
although this is not critical to our results. We obtain re- 
sults at all integer values of T/t for which our estimation 
procedure is stable (see below) and use cubic splines to 
interpolate to non-integer values as needed. 



Entropy estimation: Details for total entropy. 

There are two critical challenges to estimating the en- 
tropy of neural responses to natural signals. First, the 
overall distribution of (long) words has a Zipf-like struc- 
ture (FigdjD), which is troublesome for most estimation 
strategies and leads to biases dependent on sample size. 
Second, the long correlation times in the stimulus mean 
that, successive words 'spoken' by the neuron are strongly 
correlated, and hence it is impossible to guarantee that 
we have independent samples, as assumed implicitly in 
Eq (jAllj) . We can tame the long tails in the probability 
distribution by partitioning the space of responses, esti- 
mating entropies within each partition, and then using 
the additivity of the entropy to estimate the total. We 
investigate a variety of different partitions, including (a) 
no spikes vs. all other words, (b) no spikes, all words 
with one spike, all words with two spikes, etc., (c) no 
spikes, all words with frequencies of over 1000, and all 
other words. Further, for each partitioning, we follow p[ 
and evaluate S(T, t) for data sets of different sizes aN, 
< a < 1. Note that by choosing fractions of the data in 
different ways we can separate the problems of correla- 
tion and sample size. Thus, to check that our estimates 
are stable as a function of sample size, we choose con- 
tiguous segments of experiment, while to check for the 
impact of correlations we can 'dilute' our sampling so 
that there are longer and longer intervals between words. 
Obviously there are limits to this exploration (one can- 
not access large, very dilute samples), but as far as we 
could explore the impact of correlations on our estimates 
is negligible once the samples sizes are sufficiently large. 
For the effects of sample size we look for behavior of the 
form S(a) = Soo + Si/a + S^/a 2 and take Soo as our esti- 
mate of S(T, t), as in Ref [J]. For all partitions in which 
the the most common word (silence) is separated from 
the rest, these extrapolated estimates agree and indicate 
negligible biases at all combinations of r and T for which 
the I /a 2 term is negligible compared to the 1/a (that is, 
r > 0.5 ms at T < 25 ms). For smaller t, estimation fails 
at progressively smaller T, and to obtain an entropy rate 
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for large T we extrapolate to r/T — > using 

±S(T, r) = 5(r) + A(r/T) + B(r/T) 2 , (A13) 

where <S(t) is our best estimate of the entropy rate at res- 
olution t. All fits were of high quality, and the resulting 
error bars on the total entropy are negligible compared 
to those for the noise entropy. In principle, we could be 
missing features of the code which appear only when we 
use high resolution for very long words, but this unlikely 
scenario is almost impossible to exclude by any means. 

Entropy estimation: Details for noise entropy. 
Putting error bars on the noise entropy averaged over 
time is more difficult because these should include a con- 
tribution from the fact that our finite sample over time 
is only an approximation to the true average over the un- 
derlying distribution of stimuli. Most seriously, the en- 
tropies are very different in epochs that have net positive 
or negative velocities. Because of the way that we con- 
structed the repeated stimulus, v(t) = —v{t + To), with 
T = 2.5s; thus if we compute S n (T, r\t) +S n (T, T\t+T x ) 
with Ti ~ T), this fluctuates much less as a function of 
t than the entropy in an individual slice. Because our 
stimulus has zero mean, every slice has a partner un- 
der this shift, and the small difference between To and 
Ti takes account of the difference in latency between 
responses to positive and negative inputs. A plot of 
S n (T, r\t) + S n (T,r\t + Ti) vs time t has clear dips at 
times corresponding to zero crossings of the stimulus, 
and we partition the data at these points. We derive 
error bars on the mean noise entropy (S n (T,T\t)) t by a 
bootstrap-like method, in which we construct samples 
by randomly sampling with replacements from among 
these blocks, jittering the individual entropies S n (T, r\t) 
by the errors that emerge from the Bayesian analysis of 
individual slices. As with the total entropy we extrapo- 
late to otherwise inaccessible combinations of T and r, 
now writing 

L( Sn (T,T\t)) t = S n (T)+A( T /T) + B(r/Tf 



+Ccos(27rT/r ) (A14) 

and fitting by weighted regression. Note that results at 
different T but the same value of r are strongly corre- 
lated, and so the computation of x 2 is done using the full 
(non-diagonal) covariance matrix. The periodic term is 
important at small r, where we can see structure as the 
window size T crosses integer multiples of the average 
interspike interval, tq — 2.53 ms. Error estimates emerge 
from the regression in the standard way, and all fits had 
X 2 ~ 1 per degree of freedom. 

Impact of photon flux on information rates. 
Since there are no responses to repeated and unrepeated 
stimuli recorded at exactly the same illuminations, we 
use the data from the repeated experiment to evaluate 
both the noise entropy and the total entropy. We expect 
that we are looking for small differences, so we tighten 
our analysis by discarding the first two trials, which are 
significantly different from all the rest (presumably be- 
cause adaptation is not complete), as well as excluding 
the epochs in which the stimulus was padded with ze- 
roes. The remaining 98 trials are split into two groups 
of 49 trials each with the highest and the lowest ambi- 
ent light levels. We can then estimate the total entropy 
5^'^(T,t) for the high (h) and low (I) intensity groups 
of trials, and similarly for the noise entropy in each slice 
at time t, Sn' l \T,r\t). As noted above, assigning error 
bars is clearer once we form quantities that are balanced 
across positive and negative velocities, and we do this 
directly for the difference in noise entropies, 

AS n (T,r;t) = [S^(T, r|i) + S^(T, r\t + Ti)] 

-[S®(T,T\t)+S<P(T,T\t + T[)], 

(A15) 

where we allow for a small difference in latencies (T\—T[) 
between the groups of trials at different intensities. We 
find that AS n (T, r; t) has a unimodal distribution and a 
correlation time of ~ 1.4 ms, which allows for an easy 
evaluation of the estimation error. 
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